home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
NetNews Offline 2
/
NetNews Offline Volume 2.iso
/
news
/
comp
/
std
/
c
/
387
< prev
next >
Wrap
Internet Message Format
|
1996-08-06
|
3KB
Path: newsfeed.direct.ca!usenet
From: qjackson@direct.ca
Newsgroups: comp.lang.c,comp.lang.c++,comp.std.c
Subject: Re: Problem: Parsing Algorithms????????
Date: Thu, 15 Feb 1996 00:23:18 GMT
Organization: Parsepolis Software
Message-ID: <4ftud4$gnt@aphex.direct.ca>
References: <4535421196525ntc@compuserve.com>
Reply-To: qjackson@direct.ca
NNTP-Posting-Host: 204.174.249.1
X-Newsreader: Forte Free Agent 1.0.82
Matthew Dougherty <76477.1267@compuserve.com> wrote:
>I am writing an ANSI C language program to Parse name, address, phone,
>email, and a couple of other fields from text resumes. The idea is to
>have resumes that are emailed to be automatically entered into a database.
Probably the closest thing you're going to find in the C world is lex,
awk, sed, or perl. Lex would be the most easily integrated, but it
suffers from the fact that it is static in nature (ie. search patterns
cannot modify themselves once they have been compiled by lex into C
code).
>For street address there are key words like APT. ST. etc.
>ZipCodes are easy to find and prove because there are limited
>possibilities for states or state codes,
>Phone numbers are easy to find.
>Names are difficult. It's basically positional. It can be different
>every time. The name can be alone on a line or on the same line as phone
>or something.
>Ideas are appreciated.
I am currently working on a <standard> C++ port of LPM, an interpreted
language for pattern matching that I originally implemented in a
non-C/C++ language. It will include C wrappers to allow it to be
called from ANSI C code. The port is ~40% complete now.
LPM allows you to scan a string for patterns rather than string
literals. For instance, to find a legal North American phone number
in a given target string, one would use the rule:
[@(
[@'('$
[3'0-9'#
[@'-)'#
[)
[3'0-9'#
['-'$
[4'0-9'#
This (might) be expressed as the following RE:
(\(?[0-9]{3}[-)]?)?[0-9]{3}-[0-9]{4}
(Yes, yes, what a nightmare!)
A more robust rule would be required to scan a string for a person's
name, but using LPM, it can be done. (For example, I have a program
that uses LPM to find nouns in a text file based upon their context
within a sentence.)
If you'd like more information on LPM, just email me.
Cheers,
--
|
Parsepolis Software | Quinn Tyler Jackson
"ParseCity" | (aka 'Jamshid')
>--------------------------| qjackson@direct.ca
|---------------------->